asr output
Aligning ASR Evaluation with Human and LLM Judgments: Intelligibility Metrics Using Phonetic, Semantic, and NLI Approaches
Phukon, Bornali, Zheng, Xiuwen, Hasegawa-Johnson, Mark
Traditional ASR metrics like WER and CER fail to capture intelligibility, especially for dysarthric and dysphonic speech, where semantic alignment matters more than exact word matches. ASR systems struggle with these speech types, often producing errors like phoneme repetitions and imprecise consonants, yet the meaning remains clear to human listeners. We identify two key challenges: (1) Existing metrics do not adequately reflect intelligibility, and (2) while LLMs can refine ASR output, their effectiveness in correcting ASR transcripts of dysarthric speech remains underexplored. To address this, we propose a novel metric integrating Natural Language Inference (NLI) scores, semantic similarity, and phonetic similarity. Our ASR evaluation metric achieves a 0.890 correlation with human judgments on Speech Accessibility Project data, surpassing traditional methods and emphasizing the need to prioritize intelligibility over error-based measures.
ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features
Lin, Ye Bhone, Aung, Thura, Thu, Ye Kyaw, Oo, Thazin Myint
Abstract--This paper investigates sequence-to-sequence T ransformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IP A and alignment information. T o our knowledge, this is the first study addressing ASR error correction specifically for Burmese. W e evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word-and character-level accuracy over baseline outputs. The proposed AEC model, combining IP A and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.
Speech Recognition on TV Series with Video-guided Post-ASR Correction
Yang, Haoyuan, Zhang, Yue, Jing, Liqiang, Hansen, John H. L.
Automatic Speech Recognition (ASR) has achieved remarkable success with deep learning, driving advancements in conversational artificial intelligence, media transcription, and assistive technologies. However, ASR systems still struggle in complex environments such as TV series, where multiple speakers, overlapping speech, domain-specific terminology, and long-range contextual dependencies pose significant challenges to transcription accuracy. Existing approaches fail to explicitly leverage the rich temporal and contextual information available in the video. To address this limitation, we propose a Video-Guided Post-ASR Correction (VPC) framework that uses a Video-Large Multimodal Model (VLMM) to capture video context and refine ASR outputs. Evaluations on a TV-series benchmark show that our method consistently improves transcription accuracy in complex multimedia environments.
Context-Enhanced Granular Edit Representation for Efficient and Accurate ASR Post-editing
Vejsiu, Luan, Zheng, Qianyu, Chen, Haoxuan, Han, Yizhou
Despite ASR technology being full-scale adopted by industry and for large portions of the population, ASR systems often have errors that require editors to post-edit text quality. While LLMs are powerful post-editing tools, baseline full rewrite models have inference inefficiencies because they often generate the same redundant text over and over again. Compact edit representations have existed but often lack the efficacy and context required for optimal accuracy. This paper introduces CEGER (Context-Enhanced Granular Edit Representation), a compact edit representation that was generated for highly accurate, efficient ASR post-editing. CEGER allows LLMs to generate a sequence of structured, fine-grained, contextually rich commands to modify the original ASR output. A separate expansion module deterministically reconstructs the corrected text based on the commands. Extensive experiments on the LibriSpeech dataset that were conducted, CEGER achieves state-of-the-art accuracy, achieving the lowest word error rate (WER) versus full rewrite and prior compact representations.
Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis
Traditional audiometry often fails to fully characterize the functional impact of hearing loss on speech understanding, particularly supra-threshold deficits and frequency-specific perception challenges in conditions like presbycusis. This paper presents the development and simulated evaluation of a novel Automatic Speech Recognition (ASR)-based frequency-specific speech test designed to provide granular diagnostic insights. Our approach leverages ASR to simulate the perceptual effects of moderate sloping hearing loss by processing speech stimuli under controlled acoustic degradation and subsequently analyzing phoneme-level confusion patterns. Key findings indicate that simulated hearing loss introduces specific phoneme confusions, predominantly affecting high-frequency consonants (e.g., alveolar/palatal to labiodental substitutions) and leading to significant phoneme deletions, consistent with the acoustic cues degraded in presbycusis. A test battery curated from these ASR-derived confusions demonstrated diagnostic value, effectively differentiating between simulated normal-hearing and hearing-impaired listeners in a comprehensive simulation. This ASR-driven methodology offers a promising avenue for developing objective, granular, and frequency-specific hearing assessment tools that complement traditional audiometry. Future work will focus on validating these findings with human participants and exploring the integration of advanced AI models for enhanced diagnostic precision.
Large Language Models based ASR Error Correction for Child Conversations
Xu, Anfeng, Feng, Tiantian, Kim, So Hyun, Bishop, Somer, Lord, Catherine, Narayanan, Shrikanth
Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are under-explored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children's conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.
Integrating automatic speech recognition into remote healthcare interpreting: A pilot study of its impact on interpreting quality
Tan, Shiyi, Orăsan, Constantin, Braun, Sabine
Employing a within-subjects experiment design with four randomised conditions, this study utilises scripted medical consultations to simulate dialogue interpreting tasks. It involves four trainee interpreters with a language combination of Chinese and English. It also gathers participants' experience and perceptions of ASR support through cued retrospective reports and semi-structured interviews. Preliminary data suggest that the availability of ASR, specifically the access to full ASR transcripts and to ChatGPT-generated summaries based on ASR, effectively improved interpreting quality. Varying types of ASR output had different impacts on the distribution of interpreting error types. Participants reported similar interactive experiences with the technology, expressing their preference for full ASR transcripts. This pilot study shows encouraging results of applying ASR to dialogue-based healthcare interpreting and offers insights into the optimal ways to present ASR output to enhance interpreter experience and performance. However, it should be emphasised that the main purpose of this study was to validate the methodology and that further research with a larger sample size is necessary to confirm these findings.
MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula
Hyeon, Sieun, Jung, Kyudan, Won, Jaehee, Kim, Nam-Joon, Ryu, Hyun Gon, Lee, Hyuk-Jae, Do, Jaeyoung
In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i $\textit{side}$ of x), instead of the concise $\LaTeX{}$ format (i.e., $ e^{ix} = \cos(x) + i\sin(x) $), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured $\LaTeX{}$ representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates $\LaTeX{}$ generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for $\LaTeX{}$ translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.
The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings
Ljubešić, Nikola, Rupnik, Peter, Koržinek, Danijel
Recent significant improvements in speech and language technologies come both from self-supervised approaches over raw language data as well as various types of explicit supervision. To ensure high-quality processing of spoken data, the most useful type of explicit supervision is still the alignment between the speech signal and its corresponding text transcript, which is a data type that is not available for many languages. In this paper, we present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages based on transcripts of parliamentary proceedings and their recordings. Our starting point are the ParlaMint comparable corpora of transcripts of parliamentary proceedings of 26 national European parliaments. In the pilot run on expanding the ParlaMint corpora with aligned publicly available recordings, we focus on three Slavic languages, namely Croatian, Polish, and Serbian. The main challenge of our approach is the lack of any global alignment between the ParlaMint texts and the available recordings, as well as the sometimes varying data order in each of the modalities, which requires a novel approach in aligning long sequences of text and audio in a large search space. The results of this pilot run are three high-quality datasets that span more than 5,000 hours of speech and accompanying text transcripts. Although these datasets already make a huge difference in the availability of spoken and textual data for the three languages, we want to emphasize the potential of the presented approach in building similar datasets for many more languages.
Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models
Stepachev, Pavel, Chen, Pinzhen, Haddow, Barry
Large language models (LLMs) have started to play a vital Formally, our approach explores suitable prompting role in modelling speech and text. To explore the best use of strategies to perform speech emotion prediction from ASR context and multiple systems' outputs for post-ASR speech outputs without speech signals. Most efforts are centred on emotion prediction, we study LLM prompting on a recent creating a practical context for prompting. The contributions task named GenSEC. Our techniques include ASR transcript of this work are: ranking, variable conversation context, and system output fusion. Methodologically, we 1) select and rank ASR outputs We show that the conversation context has diminishing as LLM input using multiple metrics and 2) exploit and returns and the metric used to select the transcript for prediction fuse the conversation history and multiple ASR system is crucial.